Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® CoreTM2 Processors
نویسندگان
چکیده
The fast introduction of the Intel CoreTM2 Duo and Quad processors to the mass market has drawn attention to threadization (a.k.a. parallelization) and vectorization of the existing code in many application domains. In fact, multi-core processor vendors are eager to enable their users to exploit various levels of parallelism in order to harness the additional compute resources of multi-core processors. The Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel Core 2 Duo and Quad processors. This is accomplished by means of high-level loop optimizations and scalar optimizations to exploit multi-core processors and single-instructionmultiple-data (SIMD) instructions, combined with advanced code generation, that is built on an intimate knowledge of micro-architectural performance aspects. In this paper we outline the design and implementation of a new threadizer and vectorizer inside the Intel 10.1 compilers, and we also provide an overview of the enhanced high-level loop optimizations and the low-level code generation used to obtain higher performance on platforms based on Intel Core 2 Duo and Quad processors. Significant performance gains are shown using the SPEC CPU2006 suite running on a system configured with two Intel quad-core processors. INTRODUCTION The aggressive delivery of Intel multi-core processors to the mass computer market shows that, as the performance improvements from continuously increasing clock frequencies start to taper off, other architectural advances that reduce latency or increase memory bandwidth are gaining importance [9]. In particular, since packaging densities are still growing, integrating multiple processors on a single die and using SIMD extensions are becoming more widespread [1]. The Intel Core 2 Duo and Quad processors are equipped with a rich set of microarchitectural and architectural features to boost performance: • dual-core or quad-core on a single chip • wider execution units for Streaming SIMD Extensions (SSE, SSE2, SSE3) • a set of new instructions referred to as Supplemental Streaming SIMD Extensions 3 (SSSE3) • advanced smart shared L2 cache among cores on the same chip Due to the complexity of modern processors, compiler support has become an important part of obtaining higher performance. Most importantly, to assist programmers in leveraging all parallel capabilities of Intel’s new processors, the Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel multi-core processors and SIMD instructions by means of high-level optimizations and advanced code generation. The Intel compilers perform automatic optimizations of programs using threadization [10], vectorization [1, 2, 5], classical loop transformations (e.g., distribution, unrolling, interchange, fusion) [7, 11, 12], scalar optimizations such Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 264 as constant propagation, Partial Dead Store Elimination (PDSE), Partial Redundancy Elimination (PRE), copy propagation, Inter-Procedural Optimizations (IPO) [7], and advanced machine code generation techniques that together yield a significant performance gain compared to the default level of optimization. The contributions of the new threadizer and vectorizer are as follows: • The new threadizer yields up to 4.63x speedup (with 8 cores) by exploiting thread-level parallelism from a serial program in the SPEC CPU2006 benchmark suites. Overall, the auto-threadization delivers a 15.45% gain (geomean with 8 cores) for SPEC CFP2006 suite and a 12.17% gain (geomean with 8 cores) for SPEC CINT2006 suite. • The new vectorizer yields up to 1.28x performance speedup by exploiting SIMD-type vector parallelism from a serial program in the SPEC CPU2006 suites. Overall, the auto-vectorization delivers a 5.11% gain (geomean) for SPEC CFP2006 suite and a 2.01% gain (geomean) for SPEC CINT2006 suite. The rest of this paper is organized as follows. First, we provide some basics on the Intel CoreTM microarchitecture. Then, we discuss the design and implementation of the new threadizer and vectorizer, respectively, inside the Intel 10.1 compilers. Subsequently, we discuss the loop optimizations and enhancements made to support efficient threadization and vectorization. We also present an overview of advanced code generation for the Intel Core 2 Duo and Quad processors. Finally, we provide performance results using the SPEC CPU2006 industry-standard benchmark suite built with the Intel 10.1 C++ and FORTRAN compilers. INTEL CORETM MICROARCHITECTURE Intel Core micro-architecture is the foundation for all new Intel architecture-based desktop, mobile, and server multi-core processors. This state-of-the-art multi-core processor with optimized micro-architecture delivers a number of innovative features that have set new standards for energy-efficient performance. In this section we outline a few innovations relevant to this paper. A more detailed description can be found in the Intel literature [4]. Figure 1: Quad-core processor schematic Figure 1 shows a schematic of the Intel Core 2 Quad processor. Two independent cores with their own private L1 caches reside on a single die. Two shared Level 2 (L2) caches, referred to as the Intel Advanced Smart Cache, work by sharing the L2 cache between cores so that data are stored in one place accessible by the cores. Sharing the L2 cache enables a core to dynamically use up to 100% of the available L2 cache, thus optimizing cache resources. The quad-core processor is equipped with Intel Smart Memory Access techniques that boost system performance by optimizing available data bandwidth from the memory subsystem and hiding the latency of memory accesses through two techniques: memory disambiguation and an instruction pointer-based prefetcher that fetches memory contents to the shared L2 cache and then into each private L1 cache before they are requested. The data prefetcher can detect strided memory access patterns to make accurate predictions about future load addresses. Another key feature of Intel Core micro-architecture is the Intel Advanced Digital Media Boost that can issue 128bit SSE instructions with a throughput of one per clock cycle. Previous-generation Intel processors had a sustained throughput of one instruction per two clock cycles, typically one cycle for the lower 64 bits followed by another cycle for the upper 64 bits. By widening execution units to the full 128 bits, the Intel processor effectively doubles the performance of a series of 128-bit SSE instructions relative to previous-generation Intel processors. In addition, the latency of various individual 128-bit SSE instructions has been reduced, and SSSE3 has been added to extend the instruction set. As a result, more overall performance improvements can be expected from vectorization (i.e., transforming sequential code into SIMD instructions). REVAMPING THE THREADIZER In this section, we present our new threadizer framework that is highly integrated with our classical high-level loop optimizations, and we describe its main components. The strengths of the new threadizer include the following: • A new Abstract Thread Representation (ATR), based on the concept of virtual threads, is designed to bridge the semantic gap between high-level representation and physical (hardware or OS) threads. • Better interaction with other high-level loop-related optimizations gives better performance. • The new threadizer is moved downstream to take advantage of scalar optimizations such as global constant propagation and Single-Static-Assignment (SSA) PRE, and some loop optimizations. • A table-driven cost model simplifies maintenance and future extensibility. Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 265 • Effective runtime threadization control and multiple schedule types such as static, dynamic, guided, and runtime are supported. The threadizer in the Intel compiler serves as a single module that covers different languages (C++ and Fortran), architectures (IA-32, Intel 64, and IA-64), and operating systems (Microsoft Windows*, Linux*, and MacOS*).
منابع مشابه
Multi-Core Software
The fast introduction of the Intel CoreTM2 Duo and Quad processors to the mass market has drawn attention to threadization (a.k.a. parallelization) and vectorization of the existing code in many application domains. In fact, multi-core processor vendors are eager to enable their users to exploit various levels of parallelism in order to harness the additional compute resources of multi-core pro...
متن کاملVizer: A System to Vectorize Intel x86 Binaries
Traditional compilers conduct optimizations on intermediate representations derived from high level source code. However, it is sometimes necessary and fruitful to optimize executables or compiled object files. This paper describes the Vizer system which automatically vectorizes object code for the Intel x86 architecture. Binary optimization offers the opportunity to improve performance in situ...
متن کاملDemystifying Intel Branch Predictors
Improvement of branch predictors has been one of the focal points of computer architecture research during the last decade, ranging from two-level predictors to complex hybrid mechanisms. Most research efforts try to use real, already implemented, branch predictor sizes and organizations for comparison and evaluation. Yet, little is known about exact predictor implementation in Intel processors...
متن کاملInnovating Above and Beyond Standards
By the end of 2008, Intel will have shipped three platform generations of Intel® CoreTM2 processors enabled with Intel® vProTM technology, offering unique energy-efficient performance, built-in manageability, and proactive security features targeted at information technology (IT) organizations for large, as well as smalland medium-sized enterprises. Intel vPro technology solutions, offered in s...
متن کاملPower Management Enhancements in the 45 nm
Intels processors based on the original 45nm Intel Coret microarchitecture, originally referred to by the codename Penryn, improved the energy efficiency and performance per watt of the Intel Core microarchitecture. This paper discusses the new technologies introduced in the Penryn family of processors that enabled lower idle power and higher performance levels. The Penryn family of processors ...
متن کامل